3Vs in Big Data

Big Data is when the data is too large. Big Data usually have the 3Vs




Extracted from: https://financesonline.com/what-is-big-data-analytics-and-how-it-helps-you-understand-your-customers/

Volume: Data is too large
Variety: Data has many types
Velocity: Data increases very fast.

Map Reduce

When Data is too large we use Map Reduce. I said while teaching in Kaplan and Uiversity of Portsmouth, it is common sense that when a computer cannot process a data, we use many computers to process the data.




Extracted from: https://www.edureka.co/blog/mapreduce-tutorial/

When data is large, we split data into small data. We sent the small data to the mapper nodes. Each nodes is a computer.

In Mapper, it does all the tasks and process of data.

In reducer, it shuffles and do aggregation on all data and send to output.

Note: Mapper does all the tasks, and no task is done on the reducer.

Hadoop Ecosystem




Extracted from https://data-flair.training/blogs/hadoop-ecosystem-components/

HIve is for SQL query.
Mahout is for Machine LEarning.
HBase is for columnar store.
HDFS is where data is stored.

Apache Spark

Apache Spark is another very popular BIg Data Softwares or Systems. Apache SPark is faster and has Machine Learning libraries.

Apache Spark can be in Hadoop:




Extracted from: https://www.edureka.co/blog/hadoop-ecosystem

Tutorial:Big Data System

Go to Databricks.com Click Login




Click Sign in here.







Select import and export data.




Put in iris.csv Click Create Table in Notebook







Cmd 2 Change to something like this




Cmd 3 Change to something like this.




Cmd 4 Change to something like this.




Click to add more cells.




Cmd 5 Change to something like this.




Cmd 6 Change to something like this.




Cmd 7 Change to something like this.




Go Cmd 2, run all below